The Language Demographics of Amazon Mechanical Turk
نویسندگان
چکیده
We present a large scale study of the languages spoken by bilingual workers on Mechanical Turk (MTurk). We establish a methodology for determining the language skills of anonymous crowd workers that is more robust than simple surveying. We validate workers’ selfreported language skill claims by measuring their ability to correctly translate words, and by geolocating workers to see if they reside in countries where the languages are likely to be spoken. Rather than posting a one-off survey, we posted paid tasks consisting of 1,000 assignments to translate a total of 10,000 words in each of 100 languages. Our study ran for several months, and was highly visible on the MTurk crowdsourcing platform, increasing the chances that bilingual workers would complete it. Our study was useful both to create bilingual dictionaries and to act as census of the bilingual speakers on MTurk. We use this data to recommend languages with the largest speaker populations as good candidates for other researchers who want to develop crowdsourced, multilingual technologies. To further demonstrate the value of creating data via crowdsourcing, we hire workers to create bilingual parallel corpora in six Indian languages, and use them to train statistical machine translation systems.
منابع مشابه
Rating Computer-Generated Questions with Mechanical Turk
We use Amazon Mechanical Turk to rate computer-generated reading comprehension questions about Wikipedia articles. Such application-specific ratings can be used to train statistical rankers to improve systems’ final output, or to evaluate technologies that generate natural language. We discuss the question rating scheme we developed, assess the quality of the ratings that we gathered through Am...
متن کاملCreating Speech and Language Data With Amazon's Mechanical Turk
In this paper we give an introduction to using Amazon’s Mechanical Turk crowdsourcing platform for the purpose of collecting data for human language technologies. We survey the papers published in the NAACL2010 Workshop. 24 researchers participated in the workshop’s shared task to create data for speech and language applications with $100.
متن کاملCrowdsourcing for Language Resource Development: Critical Analysis of Amazon Mechanical Turk Overpowering Use
This article is a position paper about crowdsourced microworking systems and especially Amazon Mechanical Turk, the use of which has been steadily growing in language processing in the past few years. According to the mainstream opinion expressed in the articles of the domain, this type of on-line working platforms allows to develop very quickly all sorts of quality language resources, for a ve...
متن کاملGrowing a Spoken Language Interface on Amazon Mechanical Turk
Typically data collection, transcription, language model generation, and deployment are separate phases of creating a spoken language interface. An unfortunate consequence of this is that the recognizer usually remains a static element of systems often deployed in dynamic environments. By providing an API for human intelligence, Amazon Mechanical Turk changes the way system developers can const...
متن کاملOpportunities for Crowdsourcing Research on Amazon Mechanical Turk
Many crowdsourcing studies have been conducted that utilize Amazon Mechanical Turk, a crowdsourcing marketplace platform. The Amazon Mechanical Turk team proposes that comprehensive studies in the areas of HIT design, workflow and reviewing methodologies, and compensation strategies will benefit the crowdsourcing field by establishing a standard library of repeatable patterns and protocols. Author
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- TACL
دوره 2 شماره
صفحات -
تاریخ انتشار 2014